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Abstract 

A  unified  framework  for  3-D  shape  reconstruction  allows  us  to  combine  image-based  and 
geometry-based  information  sources.  The  image  information  is  akin  to  stereo  and  shape-from- 
shading,  while  the  geometric  information  may  be  provided  in  the  form  of  3-D  points,  3-D 
features  or  2-D  silhouettes.  A  formal  integration  framework  is  critical  in  recovering  complicated 
surfaces  because  the  information  from  a  single  source  is  often  insufficient  to  provide  a  unique 
answer. 

Our  approach  to  shape  recovery  is  to  deform  a  generic  object^centered  3-D  representation 
of  the  surface  so  as  to  minimize  an  objective  function.  This  objective  function  is  a  weighted 
sum  of  the  contributions  of  the  various  information  sources.  We  describe  these  various  terms 
individually,  our  weighting  scheme,  and  our  optimization  method.  Finally,  we  present  results  on 
a  number  of  difficult  images  of  real  scenes  for  which  a  single  source  of  information  would  have 
proved  insufficient. 


Keywords  :  Surface  reconstruction,  Stereo,  Shape-from-shading, Silhouettes,  Geometric 
straints. 
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1  Introduction 


The  recovering  of  surface  shape  from  image  cues,  the  so-called  “shape  from  X”  problem,  has  received 
tremendous  attention  in  the  computer  vision  community.  But  no  single  source  of  information  “X,” 
be  it  stereo,  shading,  texture,  geometric  constraints  or  any  other,  has  proved  to  be  sufficient  across 
a  reasonable  sampling  of  images.  To  get  good  reconstructions  of  a  surfcice,  it  is  necessary  to  use 
as  many  different  kinds  of  cues  with  as  many  views  of  the  surface  as  possible.  In  this  paper,  we 
present  and  demonstrate  a  working  framework  for  surface  reconstruction  that  combines  image  cues, 
such  as  stereo  and  shape-from-shading,  with  geometric  constraints,  such  as  those  provided  by  laser 
range  finders,  area-  and  edge-based  stereo  algorithms,  linear  features,  and  silhouettes. 

Our  framework  can  incorporate  cues  from  many  images  of  a  surface,  even  when  the  images  are 
taken  from  widely  differing  viewpoints,  accommodating  such  viewpoint-dependent  effects  as  self¬ 
occlusion  and  self-shadowing.  It  accomplishes  this  by  using  a  full  3-D  object-centered  representation 
of  the  estimated  surface.  This  representation  is  then  used  to  generate  synthetic  views  of  the 
estimated  surfcice  from  the  viewpoint  of  each  input  image.  By  using  standard  computer  graphics 
algorithms,  those  parts  of  the  surface  that  are  hidden  from  a  given  viewpoint  can  be  identified 
and  consequently  eliminated  from  the  reconstruction  process.  The  remaining  parts  are  then  in 
correspondence  with  the  input  images,  and  the  images  and  corresponding  cues  are  applied  to  the 
reconstruction  of  the  surface  in  an  iterative  manner  using  an  optimization  algorithm. 

Recent  publications  describe  the  reconstruction  of  a  surface  using  3-D  object-centered  repre¬ 
sentations,  such  as  2.1/2-D  grids  [Robert  et  o/.,  1992],  3-D  surface  meshes  [Cohen  et  ai,  1991, 
Delingette  et  ai,  1991,  Terzopoulos  and  Vasilescu,  1991,  Vemuri  and  Malladi,  1991,  Mclner- 
ney  and  Terzopoulos,  1993,  Koh  et  ai,  1994],  parameterized  surfcices  [Stokely  and  Wu,  1992, 
Lowe,  1991],  local  surfaces  [Ferrie  et  al.,  1992,  Fua  and  Sander,  1992],  particle  systems  [Szeliski 
and  Tonnesen,  1992],  and  volumetric  models  [Pentland,  1990,  Terzopoulos  and  Metaxas,  1991, 
Pentland  and  Sdaroff,  1991],  Most  of  these  rely  on  previously  computed  3-D  data,  such  as  the 
coordinates  of  points  derived  from  laser  range  finders  or  correlation-based  stereo  algorithms,  and 
reconstruct  the  surfcice  by  fitting  it  to  these  data  in  a  least-squares  sense.  In  other  words,  the 
derivation  of  the  3-D  data  from  the  images  is  completely  divorced  from  the  reconstruction  of  the 
surfeice. 

In  contrast,  our  framework  allows  us  to  directly  use  such  image  cues  as  stereo,  shading,  and 
silhouette  edges  in  the  reconstruction  process  while  simultaneously  incorporating  previously  com¬ 
puted  3-D  data  such  as  those  mentioned  above.  In  a  previous  publication  [Fua  and  Leclerc,  1994]  we 
describe  how  stereo  and  shading  are  used  within  the  framework  described  below,  and  the  relation¬ 
ship  of  this  approach  to  previous  work.  Here,  we  focus  on  how  an  additional  image  cue  {silhouette 
edges)  and  previously  computed  3-D  data  are  incorporated  into  our  reconstruction  process. 

Combining  these  different  sources  of  information  is  not  a  new  idea  in  itself.  For  example,  Blake  et 
al.  [1985]  is  the  earliest  reference  we  are  aware  of  that  discusses  the  complementary  nature  of  stereo 
and  shape-from-shading.  Both  Cryer  et  al.  [1992]  and  Heipke  et  al.  [1992]  have  proposed  algorithms 
to  combine  shape-from-shading  and  stereo,  while  Liedtke  et  al.  [l99l]  first  uses  silhouettes  to  derive 
an  initial  estimate  of  the  surface,  and  then  applies  a  multi-image  stereo  algorithm  to  improve  the 
result.  However,  none  of  the  algorithms  we  know  of  uses  an  object-centered  representation  and 
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an  optimization  procedure  that  are  general  enough  to  incorporate  all  of  the  cues  that  we  present 
here.  This  generality  should  also  make  possible  the  use  of  a  very  wide  range  of  other  sources  of 
information,  such  as  shadows,  in  addition  to  those  actually  discussed  here. 

We  view  the  contribution  of  this  paper  as  providing  both  the  framework  that  allows  us  to 
combine  diverse  sources  of  information  in  a  unified  and  computationally  effective  manner,  and  the 
specific  details  of  how  these  diverse  sources  of  information  are  derived  from  the  images. 

In  the  next  section,  we  describe  our  framework  and  the  new  information  sources  introduced 
here.  Following  this,  we  demonstrate  that  the  framework  successfully  performs  its  function  on  real 
images  and  allows  us  to  achieve  results  that  are  better  than  those  we  could  derive  from  any  one, 
or  even  two,  sources  of  information. 

2  Framework 

Our  approach  to  recovering  surface  shape  and  reflectance  properties  from  multiple  images  is  to 
deform  a  3-D  representation  of  the  surface  so  as  to  minimize  an  objective  function.  The  free 
variables  of  this  objective  function  are  the  coordinates  of  the  vertices  of  the  mesh  representing  the 
surface,  and  the  process  is  started  with  an  initial  estimate  of  the  surface.  Here  we  assume  that 
images  are  monochrome,  and  that  their  camera  models  are  known  a  priori. 

We  represent  a  surface  5  by  a  hexagonally  connected  set  of  vertices  V  =  (wj ,  j;2,  ■  ■  • ,  called 
a  mesh.  The  position  of  vertex  Vj  is  specified  by  its  Cartesian  coordinates  Zj).  Each  vertex 

in  the  interior  of  the  surface  has  exactly  six  neighbors. 

Neighboring  vertices  are  further  organized  into  triangular  planar  surface  elements  called  facets, 
denoted  F  =  (/i,  /2,  ■  •  • ,  /n/)-  The  vertices  of  a  facet  are  also  ordered  in  a  clockwise  fashion.  In  this 
work,  we  require  that  the  initial  estimate  of  the  surface  have  facets  whose  sides  are  of  equal  length. 
The  objective  function  described  below  tends  to  maintain  this  equality,  but  does  not  strictly  enforce 
it.  In  the  course  of  the  optimization,  we  refine  the  mesh  by  iteratively  subdiving  the  facets  into 
four  smaller  ones  whose  sides  are  still  of  roughly  equal  length. 

In  Figure  1,  we  show  a  shsuded  view  and  a  wireframe  representation  of  such  a  mesh.  We  also 
show  what  we  call  a  “Facet-ID”  image.  For  each  input  image,  it  is  generated  by  encoding  the 
index  i  of  each  facet  /,•  as  a  unique  color,  and  projecting  the  surface  into  the  image  plane,  using 
a  standard  hidden-surface  algorithm.  As  discussed  in  Sections  2.3  and  2.4,  we  use  it  to  determine 
which  surface  points  are  occluded  in  a  given  view  and  on  which  facets  geometric  constraints  should 
be  brought  to  bear. 

2.1  Objective  Function  and  Optimization  Procedure 

The  objective  function  €{S)  that  we  use  to  recover  the  surface  is  a  sum  of  terms  that  take  into 
account  the  image-based  constraints — stereo  and  shape-from-shading — and  the  geometry-based 
constraints — features  and  silhouettes — that  are  brought  to  bear  on  the  surface.  To  minimize  ^{<S), 
we  use  an  optimization  method  that  is  inspired  by  the  heuristic  technique  known  as  a  continuation 
method  [Terzopoulos,  1986,  Leclerc,  1989a,  Leclerc,  1989b]  in  which  we  add  a  regularization  term 
to  the  objective  function  and  progressively  reduce  its  influence.  We  define  the  total  energy  of  the 
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Figure  1:  Projection  of  a  mesh,  and  the  Facet-ID  image  used  to  accommodate  occlusions  during  surface 
reconstruction:  (a)  A  shaded  image  of  a  mesh,  (b)  A  wire-frame  representation  of  the  mesh 
(bold  white  lines)  and  the  sample  points  in  each  facet  (interior  white  points),  (c)  The  Feicet- 
ID  image,  wherein  the  color  at  a  pixel  is  chosen  to  uniquely  identify  the  visible  facet  at  that 
point  (shown  here  as  a  gray-level  image). 

mesh,  as 


eriS)  =  XD€DiS)  +  £iS) 

£^S)  =  5A.-^.(5)  .  (1) 

I 

The  £i{S)  represent  the  image  and  geometry-based  constraints,  and  the  A,-  their  relative  weights, 
as  defined  below.  £d{S),  the  regularization  term,  serves  a  dual  purpose.  First,  we  define  it  as  a 
quadratic  function  of  the  vertex  coordinates,  so  that  it  “convexifies”  the  energy  landscape  when  Ad 
is  large  and  improves  the  convergence  properties  of  the  optimization  procedure.  Second,  as  shown 
in  the  appendix,  in  the  presence  of  noise,  some  amount  of  smoothing  is  required  to  prevent  the 
mesh  from  overfitting  the  data,  and  excessively  wrinkling  the  surface. 

In  our  implementation,  we  take  £d  to  be  a  measure  of  the  curvature  or  local  deviation  from  a 
plane  at  every  vertex.  We  approximate  this  as  follows. 

Consider  a  perfectly  planar  hexagonal  mesh  for  which  the  distances  between  neighboring  vertices 
are  exactly  equal.  Let  the  neighbors  of  a  vertex  V{  be  ordered  in  clockwise  fashion  and  let  us  denote 
them  for  1  <  i  <  6.  This  notation  is  depicted  in  Figure  2(a),  If  the  hexagonal  mesh  was 

perfectly  planar,  then  the  third  neighbor  over  from  the  neighbor,  vyv,(j+3),  would  lie  on  a  straight 
line  with  u,-  and  Given  that  the  intervertex  distances  are  equal,  this  implies  that  coordinates 

of  V,-  equal  the  average  of  the  coordinates  of  and  V;v,(j+3))  for  ‘‘■ny  i- 

Given  the  above,  we  can  write  a  measure  of  the  deviation  of  the  mesh  from  a  plane  as  follows: 
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Figure  2:  Vertices  and  facets  of  a  mesh:  (a)  The  six  neighbors  Ni  (j)  of  a  vertex  v,-  are  ordered  clockwise. 

The  deformation  component  of  the  objective  function  tends  to  minimize  the  distance  between 
Vi  and  the  midpoint  of  diametrically  opposed  neighbors,  represented  by  the  dotted  circle, 
(b)  Facets  are  sampled  at  regular  intervals  as  illustrated  here.  The  stereo  component  of  the 
objective  function  is  computed  by  summing  the  variance  of  the  gray  level  of  the  projections 
of  these  sample  points,  the  p,-s.  (c)  The  albedo  of  each  facet  is  estimated  using  the  facet 

normal  it,  the  light  source  direction  ,  and  the  average  gray  level  of  the  projection  of  the 
facet  into  the  images.  The  shading  component  of  the  objective  function  is  the  sum  of  the 
squared  differences  in  estimated  albedo  across  neighboring  facets. 


n„  3 

Sd{S)  =  '^  Yl  {‘ixi -- Xh  -  +  {2yi  -  yk  -  Vk’)'^  +  {2zi  -  Zk  -  Zk'f  (2) 

j=l 

k=Ni(j) 

k'=NiU+3) 


Note  that  this  term  is  also  equivalent  to  the  squared  directional  curvature  of  the  surface  when 
the  sides  have  equal  lengths  [Kass  et  al.,  1988].  This  term  can  be  made  to  accommodate  multiple 
resolutions  of  facets  by  normalizing  each  term  by  the  nominal  intervertex  spacing  of  the  facets. 

In  previous  implementations  [Fua  and  Leclerc,  1994],  we  have  performed  optimization  using  a 
standard  conjugate-gradient  descent  procedure  [Press  et  ai,  1986].  However,  the  Ed  term  described 
here  is  amenable  to  a  “snake-like”  optimization  technique  [Kass  et  al.,  1988].  We  embed  the  curve 
in  a  viscous  medium  and  solve  the  equation  of  dynamics 


OEt 

dS 


+  a 


dS 

dt 


with 


dEx 

w 


0, 

OEd  dE 
dS  ^  dS' 


(3) 


where  Ej  is  the  total  energy  of  Equation  1,  a  the  viscosity  of  the  medium,  and  S  the  state  vector 
that  defines  the  current  position  of  the  mesh  that  is  the  vector  of  the  x,y,  and  z  coordinates  of  the 
vertices.  Since  the  deformation  energy  Ed  in  Equation  2  is  quadratic,  its  derivative  with  respect  to 
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S  is  linear,  and  therefore  Equation  3  can  be  rewritten  as 


{Ks  +  0!l)Si 


_  d£ 

I5.-1 

-  as 


5,_, 


(4) 


where 


dEp 

dS 


=  KsS, 


and  Ks  is  a  sparse  matrix.  Note  that  the  derivatives  of  Ep  with  respect  to  x,y,  and  z  are  decoupled 
so  that  we  can  rewrite  Equation  4  as  a  set  of  three  differential  equations  in  the  three  spatial 
coordinates: 


iK  +  aI)Xt 
{K  +  aI)Yt 
(K  +  aI)Zt 


aXt.i  - 
aYi-i  - 
aZ<_i  - 


dE 

dx 

dE 

dY 

dE 

dZ 


lx,_, 

y.-i 

^t—i 


where  X,Y,  and  Z  are  the  vectors  of  the  x,y,  and  z  coordinates  of  the  vertices,  and  K  is  a 
sparse  matrix.  In  fact,  for  our  hexagonal  meshes,  K  turns  out  to  be  a  banded  matrix  and  this 
set  of  equations  can  be  computed  efficiently  using  LU  decomposition  and  backsubstitution.  Note 
that  the  LU  decomposition  need  be  recomputed  only  when  a  changes.  When  ol  is  constant,  only 
the  backsubstitution  step  is  required.  In  practice  ol  is  computed  automatically  at  the  start  of  the 
optimization  procedure  so  that  a  prespecified  average  vertex  motion  amplitude  is  achieved  [Fua  and 
Leclerc,  1990].  The  optimization  proceeds  as  long  as  the  total  energy  decreases;  when  it  increases 
the  algorithm  backtracks  and  increases  a,  thereby  decreasing  the  step  size. 

We  can  optimize  all  three  spatial  components  simultaneously.  However,  when  dealing  with 
surfaces  for  which  motion  in  one  direction  leads  to  more  dramatic  changes  than  motions  in  others, 
as  is  typically  the  case  with  the  z  direction  in  Digital  Elevation  Models  (DEMs),  we  have  found  the 
following  heuristic  to  be  useful.  We  first  fix  the  x  and  y  coordinates  of  vertices  and  adjust  z  alone. 
Once  the  surface  has  been  optimized,  we  then  allow  all  three  coordinates  to  vary. 

To  speed  the  computation  and  prevent  the  mesh  from  becoming  stuck  in  undesirable  local 
minima,  we  typically  use  several  levels  of  mesh  size — three  in  the  examples  of  Section  3 — to  perform 
the  computation.  We  start  with  a  relatively  coarse  mesh  that  we  optimize.  We  then  refine  it  by 
splitting  every  facet  into  four  smaller  ones  and  reoptimizing.  Finally,  we  repeat  the  split  and 
optimization  processes  one  more  time. 


2.2  Combining  the  Components 

The  total  energy  of  Equation  1  is  a  sum  of  terms  whose  magnitudes  are  image-  or  geometry- 
dependent  and  are  therefore  not  necessarily  commensurate.  One  therefore  needs  to  scale  them 
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appropriately,  that  is  to  define  the  A  weights  so  as  to  make  the  magnitude  of  their  contributions 
commensurate  and  independent  of  the  specific  reidiometry  or  geometry  of  the  scene  under  consid¬ 
eration. 

From  Equation  4,  it  can  be  seen  that  the  dynamics  of  the  optimization  are  controlled  by  the 
gradient  of  the  objective  function.  As  a  result,  we  have  found  that  an  effective  way  to  normalize  the 
contributions  of  the  various  components  of  the  objective  function  is  to  define  a  set  of  user-specified 
weights  A'-  such  that 

E  a;  <  1 . 

1<I<TI 

These  weights  are  then  used  to  define  the  As  as  follows 

A,  =  _ L 

II  II 

Ad  =  -  (5) 

II  ^£DiS^)  II 
Ai,  =  ^(Ea:) 

i 

where  /„,  is  a  monotonically  decreasing  function  that  approaches  zero  as  AJ-  approaches  one  and 
is  the  surface  estimate  at  the  start  of  each  optimization  step.  In  our  implementation,  we  take 
fu,  (x)  =  ((1  —  x)/x)^  so  that  the  regularization  term  has  the  same  influence  as  the  sum  of  all  the 
others  when  AJ  =  0-5  We  first  proposed  this  normalization  scheme  in  [Pua  and  Leclerc,  1990], 
and  it  is  analogous  to  standard  constrained  optimization  techniques  in  which  the  various  constraints 
are  scaled  so  that  their  eigenvalues  have  comparable  magnitudes  [Luenberger,  1984].  In  practice  we 
have  found  that,  because  the  normalization  makes  the  influence  of  the  various  terms  comparable 
irrespective  of  actual  radiometry  or  dimensions,  the  user-specified  A'-  weights  are  context-specific  but 
not  image-specific.  In  other  words,  we  use  one  set  of  parameters  for  images  of  faces  when  combining 
stereo,  shape-from-shading,  and  silhouettes,  and  another  when  dealing  with  aerial  images  of  terrain 
using  stereo  and  3-D  point  constraints,  but  we  do  not  have  to  change  them  for  different  faces 
or  different  landscapes.  In  our  appendix,  we  use  synthetic  data  to  illustrate  the  behavior  of  our 
weighting  scheme  and  its  robustness,  and  in  Section  3  we  demonstrate  its  effectiveness  in  practice. 

The  continuation  method  of  Section  2.1  is  implemented  by  taking  the  initial  value  of  53,  AJ  to 
be  0.5  and  then  progressively  decreasing  it  while  keeping  the  relative  values  of  the  A's  constant. 
We  demonstrate  our  method’s  behavior  using  the  aerial  images  of  Figure  3  and  evaluate  our  results 
against  the  “ground  truth”  supplied  to  us  by  a  photogrammetrist  from  Ohio  State  University.  In 
this  example,  we  initialize  a  coarse  resolution  mesh  by  interpolating  a  correlation  map  derived  using 
the  images  reduced  by  a  factor  of  four.  We  first  apply  our  continuation  method  to  this  coarse  mesh 
using  the  stereo  component  of  the  objective  function  that  is  introduced  in  Section  2.4.  Next,  as 
discussed  in  Section  2.1,  we  increase  the  resolution  of  both  the  images  and  the  mesh,  reoptimize  and 
repeat  the  process  once  more.  At  each  level  of  resolution,  as  Ap  decreases,  the  discrepancy  between 
our  surface  model  and  the  control  points  diminishes.  In  Figure  4(a,b,c),  we  show  the  corresponding 
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Figures:  A  test  dataset  (courtesy  ofOhioState  University):  (a, b)  An  aerial  stereo  pair.  (c,d)  Matched 
pair  of  points  hand-entered  by  a  photogrammetrist.  (e)  Shaded  view  of  the  triangulated 
surface  formed  by  the  corresponding  3-D  points. 

optimized  meshes.  In  Figure  4(d),  we  plot  the  RMS^  distance  of  the  control  points  to  the  surface 
at  the  end  of  each  optimization  step.  The  final  error  at  each  level  of  resolution,  denoted  by  the 
thick  vertical  lines,  corresponds  to  an  error  in  measured  disparity  that  is  smaller  than  half  a  pixel. 
Given  the  fact  that  the  control  points  are  not  necessarily  perfect  themselves,  this  is  the  kind  of 
performance  one  would  expect  of  a  precise  stereo  system  [Giielch,  1988]. 

However,  the  real  strength  of  our  approach  lies  in  the  fact  that  it  allows  us  to  combine  image- 
based  constraints  such  as  stereo  with  geometric  constraints  such  as  the  ones  introduced  below, 
thereby  making  the  reconstruction  more  robust  in  difficult  situations. 

Note  that  the  photogrammetrist  generated  more  control  points  in  comparatively  high-relief 
areas  of  the  images  of  Figure  3(a,b)  so  that  their  triangulation,  shown  in  Figure  3(c),  forms  an 
irregular  mesh  or  TIN^.  As  shown  in  [Mclnerney  and  Terzopoulos,  1993,  Koh  ei  al.,  1994],  the 
optimization  of  such  irregular  meshes  can  be  achieved  using  a  finite-element  method.  Our  whole 
approach  could  therefore  be  extended  to  such  irregular  meshes  and  this  will  be  the  subject  of  future 
work. 

^Root  Mean  Square 
^THangular  Irregular  Network 
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Figure  4:  Behavior  of  the  continuation  method  of  Section  2.1;  (a,b,c)  Shaded  views  of  the  recon¬ 
structed  surface  at  each  level  of  resolution.  At  the  coarsest  level  the  images  are  110x64  in 
size  and  the  mesh  vertices  form  a  24x23  array.  To  go  from  one  level  to  the  next,  the  image 
dimensions  are  doubled  and  each  mesh  facet  is  subdivided  into  four,  (d)  A  plot  of  the  RMS 
distance,  in  meters,  of  the  control  points  of  Figure  3(c,d)  to  the  surface  as  the  optimization 
proceeds.  The  thick  vertical  lines  indicate  a  change  in  resolution  and  the  dotted  ones  an 
increase  by  0.1  of  the  stereo  weight  A^t  and  corresponding  decrease  in  the  Ap  regularization 
weight.  At  the  highest  resolution,  an  elevation  error  of  0.2  meter  corresponds  to  an  error  of 
approximately  0.4  pixel  in  disparity. 


2.3  Geometric  Constraints 

We  have  explored  the  constraints  generated  by  3-D  points,  3-D  linear  features,  and  2-D  silhouettes. 
2.3.1  3— D  Points 

3-D  Points  are  treated  as  attreictors  and  3-D  linear  features  are  taken  to  be  collections  of  such 
points.  The  easiest  way  to  handle  attractors  is  to  model  each  one  as  a  spring  by  adding  the  following 
term  to  the  objective  function 

Co  =  l/2((a;a  -  xi)^  +  iUa  -  yif  +  (Za  -  Xi)^)  (6) 

where  and  Zi  are  the  coordinates  of  the  mesh  vertex  closest  to  the  attractor  (lod/aiXo)- 

This,  however,  is  inadequate  if  one  wishes  to  use  facets  that  are  large  enough  so  that  attracting 
the  vertices,  as  opposed  to  the  surface  point  closest  to  the  attractor,  would  cause  unwarranted 
deformations  of  the  mesh.  This  is  especially  important  when  using  a  sparse  set  of  attreictors.  In 
this  case,  the  energy  term  of  Elquation  6  must  be  replaced  by  one  that  attracts  the  surface  without 
warping  it.  In  our  implementation,  this  is  achieved  by  redefining  Co  as 

e„  =  l/2dl  (7) 
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(a)  (b)  (c)  (d) 


Figure  5:  3-D  and  2-D  point  constraints:  (a)  Point  attractor  modeled  as  a  spring  attached  to  a  vertex. 

(b)  Point  attractor  modeled  as  a  spring  attached  to  the  closest  surface  point,  (c)  Occlusion 
contours  are  the  locus  of  the  projections  of  the  surface  points  for  which  a  camera 

ray  is  tangential  to  the  surface,  (d)  In  practice,  the  (uj.u,)  projection  of  such  a  point 
must  be  colinear  with  the  projections  of  the  vertices  of  the  facet  that  produces  the  observed 
silhouette  edge. 

where  da.  is  the  orthogonal  distance  of  the  attractor  to  the  closest  facet.  The  normal  vector  to  a 
facet  can  be  computed  as  the  normalized  cross  product  of  the  vectors  defined  by  two  sides  of  that 
facet,  and  da  as  the  dot  product  of  this  normal  vector  with  the  vector  defined  by  one  of  the  vertices 
and  the  attractor.  Letting  (a:,,  yi,  ^,)i<i<3  be  the  three  vertices  of  a  facet,  consider  the  polynomial 
D  defined  as 

xi  yt  Zi  1 

X2  yz  zz  1 

D  = 

Xs  Us  Zs  1 

Ua  Za  1 

=  Cj;X  +  Cyy  +  CzZ 

where  Cx,Cy,  and  are  polynomial  functions  of  and  z;.  It  is  easy  to  show  that  the  facet 
normal  is  parallel  to  the  vector  {Cx,Cy,Cx)  and  that  the  square  of  the  orthogonal  distance  d^  of 
the  attractor  to  the  facet  can  be  computed  as 

d'^  =  DyiC^,  +  Cl  +  Cl) 

Finding  the  “closest  facet”  to  an  attractor  is  computationally  expensive  in  general.  However,  in 
our  specific  case  the  search  can  be  made  efficient  and  fast  if  we  assume  that  the  3-D  points  can 
be  identified  by  their  projection  in  an  image.  We  project  the  mesh  in  that  image,  generate  the 
corresponding  Facet-ID  image — which  must  be  done  in  any  case  for  other  computations — and  look 
up  the  facet  number  of  the  point’s  projection.  This  applies,  for  example,  to  range  maps,  edge-  or 
correlation-based  stereo  data,  and  hand-entered  features  that  can  be  overlaid  on  various  images. 
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We  typically  recompute  the  facet  attachments  at  every  iteration  of  the  optimization  procedure  so 
as  to  allow  facets  to  slide  as  necessary.  Since  the  points  can  potentially  come  from  any  number  of 
such  images,  this  method  can  be  used  to  fuse  3-D  data  from  different  sources. 

2.3.2  Silhouettes 

Contrary  to  3-D  edges,  silhouette  edges  are  typically  2-D  features  since  they  depend  on  the  view¬ 
point  and  cannot  be  matched  across  images.  However,  as  shown  in  Figure  5(c),  they  constrain 
the  surface  tangent.  Each  point  of  the  silhouette  edge  defines  a  line  that  goes  through  the  optical 
center  of  the  camera  and  is  tangent  to  the  surface  at  its  point  of  contact  with  the  surface.  The 
points  of  a  silhouette  edge  therefore  define  a  ruled  surface  that  is  tangent  to  the  surface.  In  terms 
of  our  facetized  representation,  this  can  be  expressed  as  follows.  Given  a  silhouette  point  («,,  in 
an  image,  there  must  be  a  facet  with  vertices  (i,,  y,-,  ^,‘)i<i<3  whose  image  projections  (u;, 
as  well  as  (uaiVs),  all  lie  on  a  single  line  as  depicted  by  Figure  5(d).  This  implies  that  the  three 
determinants  of  the  form 

Ui  Uj  Ua 

Vi  Vj  V,  ,  1  <  t  <  3, :  <  j  <  3 
1  1  1 

must  be  equal  to  zero.  We  enforce  this  for  each  silhouette  point  by  adding  to  the  objective  function 
a  term  of  the  form 

2 

U{  Uj  u, 

e,  =  1/2  ^  Vi  Vj  Va  (8) 

l<i<3,i<j<3  111 

where  the  (ui,t;,)s  are  derived  from  the  (ij,y,-,z;)  using  the  camera  model. 

As  with  the  3-D  attractors  described  in  Section  2.3.1,  the  main  problem  is  to  find  the  “sil¬ 
houette  facet”  to  which  the  constraint  applies.  Since  the  silhouette  point  (ustW,)  can  lie  outside 
the  projection  of  the  current  estimate  of  the  surface,  we  search  the  Facet-ID  image  in  a  direction 
normal  to  the  silhouette  edge  for  a  facet  that  minimizes  e,  and  that  is  therefore  the  most  likely  to 
produce  the  silhouette  edge.  This,  in  conjunction  with  our  coarse-to-fine  optimization  scheme,  has 
proved  a  robust  way  of  determining  which  facets  correspond  to  silhouette  points. 

2.4  Image  Constraints 

In  this  work,  we  use  two  complementary  image-based  constraints:  stereo  and  shape-from- 
shading. 

The  stereo  component  of  the  objective  function  is  derived  by  comparing  the  gray  levels  of  the 
points  in  all  of  the  images  for  which  the  projection  of  a  given  point  on  the  surface  is  visible,  as 
determined  using  the  Facet-ID  image.  This  comparison  is  done  for  a  uniform  sampling  of  the 
surface,  as  shown  in  Figure  2(b).  This  method  allows  us  to  deal  with  arbitrarily  slanted  regions 
and  to  discount  occluded  areas  of  the  surface. 

The  shading  component,  depicted  in  Figure  2(c),  of  the  objective  function  is  computed  using  a 
method  that  does  not  invoke  the  traditional  assumption  of  constant  albedo.  Instead,  it  attempts 
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Figure  6:  Recovering  the  shape  of  a  face  by  combining  stereo  and  shape-from-shading:  (a)  (b)  (c) 
Triplet  of  face  images  (courtesy  of  INRIA).  (d)  Disparity  map.  (e)  (f)  (g)  Shaded  views  of 
the  reconstructed  surface  after  optimization,  (h)  The  recovered  albedo  map. 


to  minimize  the  variation  in  albedo  across  the  surface,  and  can  therefore  deal  with  both  constant 
albedo  surfaces  as  well  as  surfaces  whose  albedo  varies  slowly. 

Stereo  information  is  very  robust  in  textured  regions  but  potentially  unreliable  elsewhere.  We 
therefore  use  it  mainly  in  textured  areas  by  weighting  the  stereo  component  most  strongly  for  facets 
of  the  triangulation  that  project  into  textured  image  areas.  Conversely,  the  shading  information  is 
more  reliable  where  there  is  little  texture  and  is  weighted  accordingly. 

These  two  terms  are  central  to  our  approtich:  they  are  the  ones  that  allow  the  combination  of 
geometric  information  with  image  information.  However,  since  their  behavior  and  implementation 
have  already  been  extensively  discussed  elsewhere,  we  do  not  describe  them  any  further  here  and 
refer  the  interested  reader  to  our  previous  publication  [Fua  and  Leclerc,  1994].  In  Figure  6,  we 
show  the  reconstruction  of  a  face  using  only  stereo  and  shape-from -shading. 


11 


3  Applications 

Our  framework  allows  us  to  combine  geometric  constraints  with  image-based  constraints  to  de¬ 
rive  surface  reconstructions  and  to  refine  previously  computed  surfaces.  Here,  we  demonstrate  its 
capabilities  using  difficult  imagery. 

3.1  FVom  3-D  Constraints  to  Detailed  Surfaces 

Our  system  deals  with  the  various  sources  of  3-D  information,  whether  dense,  such  as  range  maps 
or  correlation-based  stereo  disparity  maps,  or  linear,  such  as  hand-entered  features  or  edge-based 
stereo  disparity  maps,  in  the  same  fashion.  Both  are  sampled  at  regular  intervals  to  generate 
collections  of  3-D  attractors  that  are  used  to  define  energy  terms  using  Equation  6  or  7. 

Especially  in  the  case  of  sparse  features,  the  “snake- type”  optimization  technique  of  Section  2.1 
has  proved  more  effective  than  more  classical  techniques  such  as  conjugate  grcidient  at  propagating 
constraints  across  the  mesh. 

3.1.1  Dense  3-D  Data 

In  Figure  7,  we  show  an  image  of  a  face  and  a  corresponding  range  map  computed  using 
structured  light.  Although  it  is  fairly  accurate,  this  particular  method  introduces  artifacts  that  are 
highlighted  in  Figure  7(c).  We  first  fit  a  surface  to  these  points  by  starting  from  a  flat  surface  and 
taking  the  total  energy  Et  of  Equation  1  to  be 

£t  —  (9) 

=  -I- Ayi 

a 

where  the  Cq  are  defined  for  each  range-data  point  as  the  attraction  terms  of  Elquation  7.  Because  of 
the  artifacts  of  the  original  range  data,  the  resulting  surface  is  approximately  correct  but  excessively 
wrinkly,  as  shown  in  Figure  7(d)  and  (e).  Of  course,  we  could  simply  smooth  the  surface  but  we 
would  then  be  at  risk  of  losing  important  details  such  as  the  mouth  or  the  fine  structures  on  the 
side  of  the  nose.  Our  approach  provides  us  with  a  better  way  of  dealing  with  this  problem:  we  can 
fuse  the  range  information  with  the  shading  information  of  the  intensity  image  of  Figure  7(a).  To 
do  so,  we  tidd  to  Ex  the  shziding  term  defined  in  Section  2.4,  that  we  denote  Esh'- 

+  ^A^A  +  ^Sh^Sh‘ 

We  restart  the  optimization  from  the  flat  initial  surface.  The  new  surface,  shown  in  Figure  8,  is 
much  smoother,  but  the  mouth  is  well  preserved  and  the  side  of  the  nose  better  defined.  Note, 
however,  that  in  the  side  views  the  bottom  of  the  nose  is  not  flat  enough.  This  is  not  surprising 
since  the  shading  information  is  of  no  use  there.  We  <iddress  this  problem  in  Section  3.2. 

3.1.2  Sparse  3-D  Data 
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Figure  7:  Fitting  a  surface  to  range  data:  (a)  Image  of  a  face  (courtesy  of  ETH  Zurich),  (b)  Corre¬ 
sponding  range  image  computed  using  structured  light,  (c)  A  window  of  the  range  image 
in  which  gray  levels  have  been  stretched  to  emphasize  the  vertical  wrinkles  and  the  his¬ 
togram  of  a  horizontal  slice,  (d)  (e)  Shaded  views  of  the  surface  reconstructed  by  using  the 
range-data  points  as  attractors,  (f)  The  corresponding  albedo  map. 


We  now  turn  to  sparse  3-D  data.  In  Figure  9,  we  show  a  stereo  pair  of  a  rock  outcrop  forming 
an  almost  vertical  cliff.  Note  that,  even  though  the  geometry  is  almost  epipolar,  these  two  images 
are  very  hard  to  fuse  both  for  humans  and  for  automated  procedures.  In  Figure  9(c),  we  show  the 
output  of  a  correlation  result  [Fua,  1993]  that  gives  no  information  about  the  shape  of  the  outcrop. 
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Figure  8:  Combining  range-data  with  shape-from-shading  information;  (a)(b)(c)  Shaded  views  of  the 
refined  reconstruction  of  the  face  of  Figure  7  using  shading. 

This  can  be  attributed  to  the  fact  that,  in  the  cliff  area,  the  fundamental  assumption  underlying 
correlation-based  stereo  using  a  fixed-shape  window  is  violated:  the  depth  is  not  constant  within 
a  correlation  window.  To  demonstrate  the  data-fusion  capabilities  of  our  approach,  we  supply  the 
3-D  edges  whose  projectiohs  are  shown  in  Figure  9(d)  and  (e).  To  do  so,  we  have  used  the  3-D 
snakes  [Fua  and  Leclerc,  1990]  that  are  embedded  in  the  SRI  Cartographic  Modeling  Environment 
(CME)  [Quam  and  Strat,  1991]:  rough  contours  are  hand-entered  and  treated  as  the  projections 
of  polygonal  3-D  curves  whose  x,y,  and  z  coordinates  are  then  optimized  to  maximize  the  average 
edge  strength  along  the  projections.  Alternatively,  we  could  take  advantage  of  the  output  of  3-D 
edge  detectors  such  as  those  described  in  [Ayache  and  Lustman,  1987,  Robert  and  Faugeras,  1991, 
Ma  and  Thonnat,  1992,  Meygret  et  a/.,  1990]. 

By  using  the  energy  term  of  Equation  9,  we  attract  an  initially  flat  surface  to  both  the  stereo 
data  and  the  3-D  outlines  and  produce  a  shape  estimate  that  is  roughly  correct  but  much  too 
smooth,  as  can  be  seen  in  Figure  10(b)  and  (c). 

By  adding  either  the  stereo  term  alone  to  Figure  10(d),  or  both  the  stereo  and  shading 
terras,  Figures  10(e)  and  (f),  we  can  generate  a  much  more  realistic  model  of  the  surface.  Note  that 
in  Figure  10(e)  the  cracks  in  the  right  side  of  the  outcrop  are  well  modeled.  Our  object-centered 
representation  has  no  trouble  accommodating  the  sharply  slanted  surfaces. 

In  Figure  11,  we  show  another  application  of  our  technique  in  a  semiurban  environment  using 
images  of  a  model  board.  We  have  used  the  3-D  snakes  to  outline  some  of  the  linear  features 
visible  in  the  images.  We  then  generate  the  rough  estimate  of  the  surface  shape  of  Figure  12(b), 
and  improve  it  using  stereo  as  shown  in  Figure  12(c).  In  addition,  we  have  used  CME  to  model 
the  buildings  as  extruded  objects.  We  exploit  them  to  mask  out  occluded  areas  when  computing 
the  stereo  energy.  This  is  achieved  naturally  in  our  system  by  using  the  projections  of  the  building 
models  in  each  view  to  zero  out  the  corresponding  Facet-ID  image.  In  this  way,  the  facet  samples 
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(c)  (d)  (e) 


Figure  9:  Semiautomated  cartography  of  a  rugged  site;  (a)  (b)  A  hard-to-fuse  stereo  pair  of  a  rock 
outcrop  with  an  almost  vertical  cliff,  (c)  Disparity  map.  Within  the  outcrop  the  correlation- 
based  algorithm  provides  almost  no  information;  outside  of  it  the  terrain  is  almost  hat.  (d) 

(e)  The  projections  of  a  few  3-D  features  outlined  using  3-D  snakes. 

that  project  at  these  locations  are  discounted  during  the  computation  of  the  stereo  energy  defined 
in  Section  2.4.  Since  buildings  cannot  be  very  well  described  by  our  smooth  mesh,  ignoring  those 
pixels  amounts  to  assuming  that  the  terrain  is  smooth  below  the  buildings  and  prevents  the  surface 
from  wrinkling  unduly. 

3.2  Refining  Previously  Derived  Models 

So  far,  we  have  shown  how  our  technique  can  be  used  to  generate  surface  models  “from  scratch.” 
However,  very  few  vision  algorithms — ours  being  no  exception — consistently  provide  a  perfect  an¬ 
swer  across  scenes  using  a  predetermined  set  of  information  sources  and  analysis  parameters.  For 
applications  such  as  cartography  or  3-D  graphics,  it  is  often  important  to  be  able  to  easily  refine  a 
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Figure  10:  Combining  3-D  constraints  with  stereo  and  shape-from-shading:  (a)  The  recovery  of  the 
terrain  for  the  aerial  scene  of  Figure  9  starts  with  a  flat  surface  that  is  attracted  by  the  3-D 
outlines  and  the  3-D  cloud  of  points  corresponding  to  the  disparity  map.  (b)  (c)  Shaded 
views  of  the  reconstructed  surface  using  only  those  constraints,  (d)  Refinement  using  stereo, 
(e)  (f)  Refinement  using  both  stereo  and  shape-from-shading. 


previously  derived  result,  such  as  an  old  DEM  or  the  output  of  a  fully  automated  procedure,  using 
additional  clues.  This  can  be  done  using  both  3-D  contours  and  silhouettes. 

We  start  with  an  example  involving  the  two  aerial  images  of  Figure  13,  at  the  top  of  which  is 
a  very  sharp  cliff  that  casts  shtidows  on  the  ground.  Starting  from  a  coarse  and  inaiccurate  DEM, 
we  generate  the  surf«u:e  shown  in  Figure  13(e),  using  stereo  alone.  By  computing  the  disparities 
associated  with  that  improved  model,  we  have  visually  checked  that  it  is  correct  except  in  the 
immediate  vicinity  of  the  cliff,  where  it  is  too  smooth.  This  should  be  expected:  our  objective 
function  Cj  includes  a  smoothness  term,  and  the  face  of  the  cliff  is  not  visible  in  those  images  and 
therefore  provides  no  stereo  clues.  By  sketching  the  edge  of  the  cliff  and  the  shadows  with  our 
3-D  snakes  and  using  them  to  add  an  attraction  term  to  the  objective  function,  we  can  deform 
the  surface  slightly  to  produce  the  result  shown  in  Figure  13(f)  where  the  ridge  is  better  defined. 
To  further  check  the  validity  of  our  result,  we  have  used  the  known  sun  direction  to  predict  which 
parts  of  the  ground  are  in  shadow.  To  do  this  we  generate  a  sun  view,  that  is  an  orthographic 
view  as  seen  from  the  sun’s  viewpoint,  and  the  corresponding  Facet-ID  image.  For  every  f«u:et. 
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Figure  11:  Semiautomated  cartography  of  a  semiurban  site:  (a)  (b)  (c)  Three  images  taken  with  differ¬ 
ent  light  source  directions,  (d)  (e)  (f)  Projections  of  hand-entered  3-D  linear  features  and 
building  blocks.  Note  that  the  bases  of  the  buildings  extend  below  the  ground. 


Figure  12:  CombiningS-Dconstraintsand  visibility  constraints  with  stereo:  (a)  The  3-D  linear  features 
of  Figure  11  above  the  fiat  plane  used  as  the  initial  surface  estimate,  (b)  A  rough  estimate  of 
the  ground-level  surface  (c)  Surface  after  optimization  using  both  stereo  and  hand-entered 
buildings  to  mask  occluded  areas. 


we  compute  the  proportion  of  samples  that  are  visible  in  this  sun  view  as  shown  in  Figure  13(g). 
The  facets  for  which  a  large  proportion  of  samples  is  occluded  are  those  in  shadow.  As  can  be 
seen,  these  shadowed  facets  match  the  aictual  shadows  fairly  well,  which  leads  us  to  believe  that 
our  reconstruction  is  accurate. 

Silhouettes  are  also  very  good  indicators  of  the  quality  of  a  reconstruction.  For  example,  the 
reconstruction  of  the  bottom  of  the  nose  in  Figure  8  is  not  quite  right  as  evidenced  by  its  silhouette 
in  the  side  view  of  the  same  man  shown  in  Figure  14.  However,  we  can  use  the  silhouette  constraints 
of  Section  2.3.2  with  the  two  silhouettes  shown  in  the  figure.  The  silhouettes  are  2-D  curves  that 


Figure  13:  Improving  and  checking  a  DEM:  (a)  (b)  An  aerial  stereo  pair  of  a  cliff  with  clearly  vis¬ 
ible  shadows,  (c)  (d)  The  cliff’s  ridge  and  cast  shadows  outlined  using  3-D  snakes,  (e) 
Reconstructed  surface  using  stereo  alone,  (f)  Reconstructed  surface  using  both  stereo  and 
the  3-D  outlines  as  attractors,  (g)  Predicted  shadow  areas  in  black.  The  prediction  was 
carried  out  using  the  reconstruction  shown  in  (f)  and  the  known  sun  direction.  Note  that 
these  hypothesized  shadows  closely  match  the  actual  ones.  Note  also  that,  were  we  to  use 
the  original  reconstruction  shown  in  (e)  to  perform  this  computation,  no  shadows  would  be 
predicted  because  the  surface  is  too  smooth. 


have  been  outlined  using  2-D  snakes.  In  the  manner  of  Section  3.1.1,  we  take  the  total  energy 
to  be 

+  ^Sh^Sh 

£s  = 

3 

where  the  are  the  silhouette  attraction  terms  of  Equation  8  and  £sh  the  shading  term  described 
in  Section  2.4.  We  use  these  terms  to  deform  the  nose  region  and  generate  the  improved  result 
shown  in  Figure  14(c). 

The  face  reconstruction  of  Figure  6  presents  us  with  a  slightly  different  problem.  We  have  used 
a  correlation-based  stereo  algorithm  to  provide  us  with  an  initial  estimate.  This  algorithm  gave 
us  no  information  on  the  sharply  slanted  parts  of  the  face,  which  are  therefore  missing  from  the 
reconstruction.  The  silhouettes  of  the  face,  however,  are  clearly  visible  in  Figure  15  and  easy  to 
outline.  To  take  advantage  of  these,  we  again  use  a  coarse-to-fine  strategy.  We  start  with  a  larger 
and  coarser  mesh  that  evolves  under  the  influence  of  the  silhouettes  and  the  vertices  of  the  original 
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Figure  14:  Using  silhouettes  to  improve  a  reconstruction:  (a)  The  face  of  Figure  7  with  a  silhouette 
at  the  bottom  of  the  nose  outlined,  (b)  A  side  view  of  the  same  face  with  a  second  nose 
silhouette,  (c)  Shaded  views  of  the  refined  reconstruction  using  both  shading  and  the  two 
silhouettes. 

reconstruction  that  are  treated  as  attractors.  When  the  mesh  has  been  refined  and  optimized,  we 
complete  the  optimization  procedure  by  turning  on  the  full  objective  function: 

=  Aufi)  +  Xs^s  +  XshSsh  +  XstCsi , 

where  and  Ssi  denote  the  shading  and  stereo  terms  presented  in  Section  2.4.  The  results  are 
shown  in  Figure  15(c), (d)  and  (e). 

The  silhouettes  used  in  the  two  examples  above  have  been  entered  semiautomatically.  But 
here  again,  we  could  take  advantage  of  automatically  extracted  ones  [Cipolla  and  Blake,  1990, 
Liedtke  et  al.,  1991,  Vaillant  and  Faugeras,  1992]. 

4  Conclusion 

We  have  presented  a  surface  reconstruction  method  that  uses  an  object-centered  representation  to 
recover  3-D  surfaces.  Our  method  uses  both  monocular  shading  cues  and  stereoscopic  cues  from 
any  number  of  images  while  correctly  handling  self-occlusions.  It  can  also  take  advantage  of  the 
geometric  constraints  derived  from  measured  3-D  points  and  2-D  silhouettes.  These  complementary 
sources  of  information  are  combined  in  a  unified  manner  so  that  new  ones  can  be  added  easily  as 
they  become  available. 
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Figure  15:  Using  silhouettes  to  expand  the  scope  of  our  method:  (a)  (b)  (c)  Silhouettes  of  the  face 
in  the  three  views  of  Figure  6  outlined  using  2-D  snakes,  (d)  (e)  (f)  Shaded  views  of 
reconstructed  surface  after  optimization  using  stereo,  shading,  and  the  constraints  provided 
by  the  silhouettes. 


Using  a  variety  of  real  imagery,  we  have  demonstrated  that  the  resulting  method  is  quite  pow¬ 
erful  and  flexible,  allowing  for  both  completely  automatic  reconstruction  in  straightforward  cir¬ 
cumstances,  and  for  user-assisted  reconstruction  in  more  complex  circumstances.  User  assistance  is 
provided  primarily  through  the  introduction  and  identification  of  a  small  number  of  hand-entered 
linear  and  point  features  using  semi-automated  “snake”  technology.  The  method  is  also  controlled 
by  a  small  number  of  parameters  that  specify  the  relative  importance  of  the  various  information 
sources.  These  parameters  typically  do  not  need  to  be  adjusted  for  images  within  a  given  class 
(such  as  face  images  or  high-altitude  aerial  images),  but  only  across  classes. 
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The  method  has  valuable  capabilities  for  applications  such  as  3-D  graphics  model  generation 
and  high-resolution  cartography  in  which  a  human  can  select  the  sources  of  information  to  be  used 
and  their  relative  importance.  For  example,  in  the  case  of  mapping,  one  must  ensure  that  the 
terrain  model  conforms  to  the  feature  data  and  does  not  violate  any  physical  constraints;  roads 
should  be  on  the  ground  and  not  overly  tilted,  streams  should  stay  within  stream  beds,  buildings 
should  not  be  floating  in  space,  and  so  on.  Our  method  allows  one  to  both  satisfy  these  constraints 
and  account  as  well  as  possible  for  the  observed  image  data. 

In  future  work,  we  will  study  in  a  more  quantitative  manner  the  influence  of  the  various  terms 
of  our  objective  function  and  their  relative  weights.  This  will  require  the  use  of  ground-truth 
and  carefully  controlled  conditions.  We  plan  to  set  up  a  facility  that  will  allow  us  to  acquire 
the  necessary  data.  We  will  also  strive  to  replace  some  of  the  hand-entered  geometric  cues  by 
automatically  extracted  ones  and  to  investigate  more  complex  topologies  than  the  ones  shown  here. 
A  principled  way  to  do  so  would  be  to  rephrase  our  modeling  task  as  one  of  finding  the  “best” 
description  of  a  scene  in  terms  of  the  Minimum  Description  Length  (MDL)  principle  [Rissanen, 
1987,  Leclerc,  1989a,  Fua  and  Hanson,  1991].  It  can  be  shown  that  the  objective  function  that 
we  propose  here  can  be  reformulated  in  terms  of  the  MDL  principle.  After  optimization  using 
stereo  and  shape  from  shading,  the  surface  ought  to  provide  the  best  possible  compromise  between 
simplicity  of  description  of  the  surface  and  fit  to  the  image  data  in  terms  of  the  simple  vocabulary  of 
triangulated  meshes.  The  extensions  that  we  have  described  above  allow  us  to  enrich  the  vocabulary 
by  adding  new  primitives — ridges,  building,  roads,  and  so  on — that  allow  an  even  more  effective 
description.  This  approach  would  give  us  a  principled  way  to  accept  or  reject  new  objects  in  our 
overall  representation. 
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Appendix:  Robustness  of  the  constraint  weighting  scheme 

In  Section  2.2,  we  proposed  a  weighting  scheme  for  the — in  general  noncommensurate — components 
of  the  objective  function  of  Ekjuation  1.  In  this  appendix,  we  use  a  specific  example  to  illustrate 
the  ability  of  our  method  to  combine  stereo  constraints  and  externally  supplied  3-D  and  2-D 
geometric  constraints  in  the  presence  of  noise  and  the  relative  insensitivity  of  our  procedure  to 
parameter  settings. 


Figure  A.l:  Three  synthetic  images  generated  by  texture  mapping  the  image  of  a  face  onto  the  hemi¬ 
spheric  surface  shown  in  Figure  A. 2  as  seen  from  three  different  viewpoints. 


Figure  A.2:  (a)  Hemispheric  surface  used  to  generate  the  images  of  Figure  A.l  and  taken  to  be  the 
“ground  truth”  for  the  experiments  described  in  this  appendix  (b)  The  3-D  geometric  con¬ 
straints  are  in  the  form  of  25  regularly  spaced  3-D  points  lying  on  the  hemisphere,  shown 
as  white  crosses,  some  of  which  are  occluded.  {c,d)  The  2-D  geometric  constraints  are  in 
the  form  of  two  silhouette  edges  shown  as  thick  white  lines. 

The  images  used  here  are  shown  in  Figure  A.l.  They  have  been  generated  by  texture  mapping 
the  image  of  a  face  onto  the  hemispheric  surface  of  Figure  A.2(a)  as  seen  from  three  different 
viewpoints.  We  take  the  3-D  geometric  constraints  to  be  given  by  a  set  of  25  regularly  spaced  3-D 
attractors  lying  on  the  hemisphere  and  shown  in  Figure  A.2(b).  The  2-D  constraints  are  given  by 
the  two  occluding  contours  of  Figures  A.2(c)  and  (d).  We  can  therefore  write  the  total  objective 
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function  of  Ekjuation  1  as 


St{S)  =  Xd£d{S)  +  S{S) 

S^S)  =  \si£stiS)  +  Xa£a{S)  +  Xsi:s{S)  . 

where  £st  is  the  stereo  term,  £a  fhe  sum  of  the  3-D  attraction  terms  of  Ekjuation  7,  and  £s  the 
sum  of  the  silhouette  attraction  terms  of  Equation  8.  At  the  start  of  each  optimization  step,  the 
Xi  coefficients  are  recomputed  according  to  Elquation  5  using  a  set  of  user-supplier  that  specify 
the  relative  importance  of  the  various  terms. 

Here  we  study  the  influence  of  the  user-supplied  weights  and  A^,  on  the  distance 

between  the  surface  reconstructed  by  minimizing  Et  and  the  “ground  truth”  surface  of  Figure 
A.2(a). 

For  each  setting  of  the  parameters,  in  Figures  A.3,  A.4,  and  A.5,  we  plot  four  curves  corre¬ 
sponding  to  four  different  amounts  of  Gaussian  white  noise — of  respective  variance  2.5%,  5.0%, 
7.5%  and  10%  of  the  images’dynamic  range — added  to  the  images  to  degrade  the  stereo  term.  The 
curves  were  obtained  by  averaging  the  results  of  several  trials,  all  starting  from  a  randomized  flat 
surface  and  utilizing  our  continuation  method  with  five  increasing  values  of 

=  A5J  +  A^  -1-  A5, 

the  sum  of  the  user-supplied  weights,  ranging  from  0.5  to  0.9.  As  in  Figure  4,  the  graphs  represent 
the  RMS  reconstruction  error  at  the  end  of  each  optimization  step  as  a  function  of  Sx.  In  this  set  of 
experiments,  we  allowed  only  the  2  coordinates  of  the  vertices  to  vary.  We  also  fixed  the  boundary 
vertices  so  as  to  eliminate  the  effect  of  the  gray-level  discontinuities  at  the  border  between  the 
texture- mapped  part  of  the  images  and  their  black  background. 

The  error  is  measured  by  the  difference  in  elevation  between  the  reconstructed  vertices  and  the 
elevation  they  would  have  if  they  were  on  the  actual  “ground  truth”  surface  of  Figure  A.2(a).  Note 
that  the  difference  in  elevation  between  the  top  and  the  bottom  of  the  hemisphere  is  34  units  of 
elevation  and  that  an  error  of  1  unit  of  elevation  corresponds  to  a  difference  in  computed  disparities 
of  approximately  0.25  pixel  for  projections  from  the  image  of  Figure  A. 1(a)  into  those  of  Figure 
A.l(b)  and  (c). 

In  Figure  A.3,  we  show  the  behavior  of  our  continuation  method  using  stereo  alone,  that  is 
taking  A^  and  A^  to  be  zero.  In  Figure  A.3(a),  we  draw  as  solid  lines  the  four  curves  derived  using 
all  three  noisy  images  at  the  same  time  and  in  Figure  A.3(b,c)  those  obtained  using  only  two  images 
at  a  time.  For  comparison’s  sake,  we  also  plot  as  dashed  lines  the  curves  computed  using  noise-free 
images.  As  the  abscissa  is  traversed  rightwards,  Sx  =  A^j  increases  and  A^  decreases,  resulting  in 
curves  having  the  same  shape  as  that  of  Figure  4.  Note,  however,  that  for  the  higher  noise  values, 
the  best  result  is  not  achieved  for  the  largest  value  of  Sx  but  for  one  slightly  smaller.  As  discussed 
in  Section  2,  in  the  presence  of  noise,  smoothing  is  required  to  prevent  the  surface  from  overfitting 
the  data. 

In  Figure  A.4,  we  plot  the  equivalent  curves  for  different  values  of  A^  and  A^.  Figures  A.4(a) 
and  (b)  illustrate  the  use  of  the  3-D  point  constraints  along  with  three-image  stereo.  Graph  (a) 
was  generated  using  A^  =  0.45a,  X'st  =  0.65a  and  graph  (b)  using  X'^  =  0.25a,  A^j  =  0.85a.  In 
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Figure  A.3:  Continuation  method  using  stereo  alone:  plot  of  elevation  errors  as  a  function  of  the  reg¬ 
ularization  parameters  A'^^,  using  all  three  images  simultaneously  (a),  using  only  images  1 
and  2  (b),  and  using  only  images  1  and  3  (c).  The  dashed  curve  corresponds  to  noise-free 
images  and  the  four  solid  curves  to  increasing  amounts  of  white  noise  being  added  to  the 
images.  Using  all  three  images  yields  substantially  better  results  than  any  of  the  pairs. 


Figure  A.4:  Combining  noise-free  constraints  with  three-image  stereo:  (a)  Using  heavily  weighted  3-D 
point  constraints,  (b)  Using  less  heavily  weighted  3-D  point  constraints  (c)  Using  heavily 
weighted  2-D  silhouette  constraints,  (d)  Using  less  heavily  weighted  2-D  silhouette  con¬ 
straints. 


other  words,  the  geometric  constraint  is  weighted  more  heavily  in  the  first  case  than  in  the  second. 
As  expected,  in  the  absence  of  noise  the  results  are  indistinguishable  from  those  of  Figure  A.3(a). 
However,  in  the  presence  of  noise,  the  constraints  consistently  improve  the  outcome.  Since  the  3-D 
points  lie  exactly  on  the  constraint  surface,  the  improvement  is  larger  when  the  3-D  constraint  is 
weighted  more  heavily.  The  same  effect  can  be  observed  by  using  the  2~D  silhouette  constraints 
with  =  0.45a)  ^'st  =  0.65a,  graph  (c),  and  A'^  =  0.25a,  ^si  ~  0-85a,  graph  (d).  The  average 
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improvement  is  not  as  large  because  the  silhouette  constraints  are  more  localized  but  the  observed 
trends  are  similar. 


Figure  A. 5:  Noisy  3-D  constraints  using  the  same  parameters  as  in  Figure  A. 4:  (a)  (b)  The  elevation  of 
the  attactors  has  been  randomized  by  cidding  noise  of  variance  1.  (c)  (d)  The  elevation  of 
the  attactors  has  been  randomized  by  adding  noise  of  variance  2. 

Note,  however,  that  the  constraints  used  above  were  “perfect”  in  the  sense  that  the  3-D  points 
lie  exactly  on  the  “ground  truth”  surface.  This  is  not  realistic  in  general  as  there  always  will  be 
some  imprecision.  In  Figure  A. 5,  we  show  the  result  of  rerunning  the  same  experiments  as  before, 
after  having  randomized  the  elevation  of  the  3-D  attractors.  Since  the  precision  of  the  constraints 
has  now  degraded,  their  use  yields  an  improvement  over  stereo  alone  only  when  enough  noise  has 
been  added  to  the  images  so  that  the  reliability  of  the  stereo  term  is  less  than  that  expected  of  the 
constraints,  and  this  independently  of  the  exact  weights  chosen. 

We  have  shown  that,  on  a  specific  example,  our  method  for  combining  the  constraints  is  robust 
in  the  presence  of  noise.  The  exact  numbers  we  obtain  may  change  slightly  but  the  overall  behavior 
of  the  optimization  procedure  is  fairly  constant  for  different  settings  of  the  user-supplied  weights 
and  yields  intuitively  satisfactory  results. 

Because  of  the  extreme  complexity  of  the  image  potentials,  a  full  mathematical  treatment  of  the 
behavior  of  the  objective  function  is  beyond  the  scope  of  this  paper.  However,  in  practice,  we  have 
observed  the  same  relative  invariance  of  the  results  with  respect  to  changes  of  parameter  settings. 
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